38 research outputs found

    Adaptive Analysis and Processing of Structured Multilingual Documents

    Get PDF
    Digital document processing is becoming popular for application to office and library automation, bank and postal services, publishing houses and communication management. In recent years, the demand for tools capable of searching written and spoken sources of multilingual information has increased tremendously, where the bilingual dictionary is one of the important resource to provide the required information. Processing and analysis of bilingual dictionaries brought up the challenges of dealing with many different scripts, some of which are unknown to the designer. A framework is presented to adaptively analyze and process structured multilingual documents, where adaptability is applied to every step. The proposed framework involves: (1) General word-level script identification using Gabor filter. (2) Font classification using the grating cell operator. (3) General word-level style identification using Gaussian mixture model. (4) An adaptable Hindi OCR based on generalized Hausdorff image comparison. (5) Retargetable OCR with automatic training sample creation and its applications to different scripts. (6) Bootstrapping entry segmentation, which segments each page into functional entries for parsing. Experimental results working on different scripts, such as Chinese, Korean, Arabic, Devanagari, and Khmer, demonstrate that the proposed framework can save human efforts significantly by making each phase adaptive

    PARSING AND TAGGING OF BINLINGUAL DICTIONARY

    Get PDF
    Bilingual dictionaries hold great potential as a source of lexical resources for training and testing automated systems for optical character recognition, machine translation, and cross-language information retrieval. In this paper, we describe a system for extracting term lexicons from printed bilingual dictionaries. Our work was divided into three phases - dictionary segmentation, entry tagging, and generation. In segmentation, pages are divided into logical entries based on structural features learned from selected examples. The extracted entries are associated with functional labels and passed to a tagging module which associates linguistic labels with each word or phrase in the entry. The output of the system is a structure that represents the entries from the dictionary. We have used this approach to parse a variety of dictionaries with both Latin and non-Latin alphabets, and demonstrate the results of term lexicon generation for retrieval from a collection of French news stories using English queries. (LAMP-TR-106) (CAR-TR-991) (UMIACS-TR-2003-97

    A physics-constrained machine learning method for mapping gapless land surface temperature

    Full text link
    More accurate, spatio-temporally, and physically consistent LST estimation has been a main interest in Earth system research. Developing physics-driven mechanism models and data-driven machine learning (ML) models are two major paradigms for gapless LST estimation, which have their respective advantages and disadvantages. In this paper, a physics-constrained ML model, which combines the strengths in the mechanism model and ML model, is proposed to generate gapless LST with physical meanings and high accuracy. The hybrid model employs ML as the primary architecture, under which the input variable physical constraints are incorporated to enhance the interpretability and extrapolation ability of the model. Specifically, the light gradient-boosting machine (LGBM) model, which uses only remote sensing data as input, serves as the pure ML model. Physical constraints (PCs) are coupled by further incorporating key Community Land Model (CLM) forcing data (cause) and CLM simulation data (effect) as inputs into the LGBM model. This integration forms the PC-LGBM model, which incorporates surface energy balance (SEB) constraints underlying the data in CLM-LST modeling within a biophysical framework. Compared with a pure physical method and pure ML methods, the PC-LGBM model improves the prediction accuracy and physical interpretability of LST. It also demonstrates a good extrapolation ability for the responses to extreme weather cases, suggesting that the PC-LGBM model enables not only empirical learning from data but also rationally derived from theory. The proposed method represents an innovative way to map accurate and physically interpretable gapless LST, and could provide insights to accelerate knowledge discovery in land surface processes and data mining in geographical parameter estimation

    Adaptive Hindi Ocr Using Generalized Hausdorff Image Comparison

    Get PDF
    We present an adaptive Hindi OCR implemented as part of a rapidly retargetable language tool e#ort. The system includes: script identification, character segmentation, training sample creation and character recognition. In the step of script identification, Hindi words are identified from bilingual or mitilingual documents based on features of the Devanagari script or using Support Vector Machine (SVM). Identified words are then segmented into individual characters in the next step, where the composite characters are identified and further segmented based on the structural properties of the srcipt and statistical information. Segmented characters are recognized using generalized Hausdor# image comparison and post-processing is applied to improve the performance. The OCR system (designed and implemented in one month) was applied to a complete Hindi-English bilingual dictionary and a set of ideal images extracted from Hindi documents in PDF format. Experimental results show the recognition accuracy can reach 88% for noisy images and 95% for ideal images. The presented method can also be extended to design OCR systems for di#erent scripts

    Word level script identification for scanned document images

    No full text
    In this paper, we compare the performance of three classifiers used to identify the script of words in scanned document images. In both training and testing, a Gabor filter is applied and 16 channels of features are extracted. Three classifiers (Support Vector Machines (SVM), Gaussian Mixture Model (GMM) and k-Nearest-Neighbor (k-NN)) are used to identify different scripts at the word level (glyphs separated by white space). These three classifiers are applied to a variety of bilingual dictionaries and their performance is compared. Experimental results show the capability of Gabor filter to capture script features and the effectiveness of these three classifiers for script identification at the word level

    Font Identification Using the Grating Cell Texture Operator

    No full text
    In this paper, a new feature extraction operator, the grating cell operator, is applied to analyze the texture features and classify fonts of scanned document images. This operator is compared with the isotropic Gabor filter which was also employed for font classification. In order to improve the performance, a back-propagation neural network (BPNN) classifier was applied and compared with the simple weighted Euclidean distance (WED) classifier. Experimental results for five fonts of three scripts show that the grating cell operator performs better than the isotropic Gabor filter, and the BPNN classifier can provide more accurate classification results than the WED classifier

    NHC–AuCl/Selectfluor: A Highly Efficient Catalytic System for Carbene-Transfer Reactions

    No full text
    The combination of NHC–gold complex and Selectfluor has been found to be a highly efficient catalyst system for carbene-transfer reactions, with a turnover number (TON) up to 990000 and a turnover frequency (TOF) up to 82500 h<sup>–1</sup>
    corecore